Algorithms for Molecular Biology
○ Springer Science and Business Media LLC
All preprints, ranked by how well they match Algorithms for Molecular Biology's content profile, based on 15 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Ingels, F.; Robidou, L.; Martayan, I.; Marchet, C.; Limasset, A.
Show abstract
High-throughput sequence analysis commonly relies on k-mers (words of fixed length k) to remain tractable at modern scales. These k-mer-based pipelines can employ a sampling step, which in turn allows grouping consecutive k-mers into larger strings to improve data locality. Although other sampling strategies exist, local schemes have become standard: such schemes map each k-mer to the position of one of its characters. A key performance measure of these schemes is their density, defined as the expected fraction of selected positions. The most widely used local scheme is the minimizer scheme: given an integer m [≤] k, a minimizer scheme associates each k-mer to the starting position of one of its m-mers, called its minimizer. Being a local scheme, the minimizer scheme guarantees covering all k-mers of a sequence, with a maximal distance between selected positions of w = k - m + 1. Recent works have established near-tight lower bounds on achievable density under standard assumptions for local schemes, and state-of-the-art schemes now operate close to these limits, suggesting that further improvements under the classical notion of density will face diminishing returns. Hence, in this work, we aim to revisit the notion of density and broaden its scope. As a first contribution, we draw a link between density and the distance between consecutive selected positions. We propose a probabilistic model allowing us to establish that the density of a local scheme is exactly the inverse of the expected distance between the positions it selects, under the minimal and only assumption that said distances are somehow equally distributed. We emphasize here that our model makes no assumptions about how positions are selected, unlike the classical models in the literature. Our result introduces a novel method for computing the density of a local scheme, extending beyond classical settings. Based on this analysis, we introduce a novel technique, named multiminimizers, by associating each k-mer with a bounded set of candidate minimizers rather than a single one. The candidate furthest away (in a precise sense defined in the article) is selected. Since the decision is made by taking advantage of a context beyond a single k-mer, this technique is not a local scheme -- and belong to a novel category of meta schemes. Using the multiminimizer trick on a local scheme reduces its density at the expense of a controlled increase in computation time. We show that this method, when applied to random (hash-based) minimizers and to open-closed mod-minimizers, approaches a density of [Formula], representing, to our knowledge, the first construction converging to this limit. Our third contribution is the introduction of the deduplicated density, which measures the fraction of distinct minimizers used to cover all k-mers of a set of sequences. While this problem has gained traction in applications such as assembly, filtering, and pattern matching, standard minimizer schemes are often used as a proxy, blurring the distinction between the two objectives (minimizing the number of selected positions or the number of selected minimizers). Although related to the classical notion of density, deduplicated density differs in both definition and suitable constructions, and must be analyzed in its own right, together with its precise connections to standard density. We show that multiminimizers can also improve this metric, but that globally minimizing deduplicated density in this setting is NP-complete, and we instead propose a local heuristic with strong empirical behavior. Finally, we show that multiminimizers can be computed efficiently, and provide a SIMD-accelerated Rust implementation together with proofs of concept demonstrating reduced memory footprints on core sequence-analysis tasks. We conclude with open theoretical and practical questions that remain to be addressed in the area of density.
Korchmaros, A.; Hellmuth, M.; Ramirez-Rafael, J. A.; Schmidt, B.; Stadler, P. F.; Thekkumpadan Puthiyaveedu, S.
Show abstract
Horizontal gene transfer is an important contributor to evolution. Following Walter M. Fitch, two genes are xenologs if at least one HGT separates them. More formally, the directed Fitch graph has a set of genes as its vertices, and directed edges (x, y) for all pairs of genes x and y for which y has been horizontally transferred at least once since it diverged from the last common ancestor of x and y. Subgraphs of Fitch graphs can be inferred by comparative sequence analysis. In many cases, however, only partial knowledge about the "full" Fitch graph can be obtained. Here, we characterize Fitch-satisfiable graphs that can be extended to a biologically feasible "full" Fitch graph and derive a simple polynomial-time recognition algorithm. We then proceed to show that several versions of finding the Fitch graph with total maximum (confidence) edge-weights are NP-hard. In addition, we provide a greedy-heuristic for "optimally" recovering Fitch graphs from partial ones. Somewhat surprisingly, even if [~] 80% of information of the underlying input Fitch-graph G is lost (i.e., the partial Fitch graph contains only [~] 20% of the edges of G), it is possible to recover [~] 90% of the original edges of G on average.
B. Rocha, L.; Adi, S. S.; Araujo, E.
Show abstract
In computational biology, mapping a sequence s onto a sequence graph G poses a significant challenge. One possible approach to tackling this problem is to find a walk p in G that spells a sequence most similar to s. This challenge is formally known as the Graph Sequence Mapping Problem (GSMP). In this paper, we delve into an alternative problem formulation known as the De Bruijn Graph Sequence Mapping Problem (BSMP). Both problems have three variants: changes only in the sequence, changes in the graph, and changes in both the sequence and the graph. We concentrate on addressing the variant involving changes in the graph. In the literature, when this problem does not allow the De Bruijn graph to induce new arcs after changes, it becomes NP-complete, as proven by Gibney et. al [4]. However, we reformulate the problem by considering the characteristics of the arcs induced in the De Bruijn graph. This reformulation alters the problem definition, thereby enabling the application of a polynomial-time algorithm for its resolution. Approaching the problem with this arc-inducing characteristic is new, and the algorithm proposed in this work is new in the literature.
Rahman, M. K.; Rahman, M. S.
Show abstract
The genome rearrangement problem computes the minimum number of operations that are required to sort all elements of a permutation. A block-interchange operation exchanges two blocks of a permutation which are not necessarily adjacent and in a prefix block-interchange, one block is always the prefix of that permutation. In this paper, we focus on applying prefix block-interchanges on binary and ternary strings. We present upper bounds to group and sort a given binary/ternary string. We also provide upper bounds for a different version of the block-interchange operation which we refer to as the restricted prefix block-interchange. We observe that our obtained upper bound for restricted prefix block-interchange operations on binary strings is better than that of other genome rearrangement operations to group fully normalized binary strings. Consequently, we provide a linear-time algorithm to solve the problem of grouping binary normalized strings by restricted prefix block-interchanges. We also provide a polynomial time algorithm to group normalized ternary strings by prefix block-interchange operations. Finally, we provide a classification for ternary strings based on the required number of prefix block-interchange operations.
Vasei, H.; Foroughmand-Araabi, M.-H.; Daneshgar, A.
Show abstract
Tumor mutation trees are the primary tools to model the evolution of cancer. Not only some tumor phylogeny inference methods may produce a set of trees having potential and parallel evolutionary histories, but also mutation trees from different patients may also exhibit similar evolutionary processes. When a set of correlated mutation trees is available, compressing the data into a single best-fit tree, exhibiting the shared evolutionary processes, is definitely of great importance and can be beneficial in many applications. In this study, we present a general setup to study and analyse the problem of finding a best-fit (centroid) tree to a given set of trees and we use our general setup to analyse mutation trees as our main motivation. For this let{varepsilon} :[T] n [->] [R]nxn be an embedding of labeled rooted trees into the space of real square matrices and also let L be a norm on this space. We introduce the nearest mapped tree problem as the problem of finding a closest tree to a given matrix A with respect to{varepsilon} and L, i.e., a tree T *(A) for which L({varepsilon}(T *(A)) - A) is minimized. Within this setup, our potential candidates for the embedding are adjacency, ancestry, and distance matrices of trees, where we consider the cases of L1 and L2 norms in our analysis. We show that the function d(T1, T2) = L({varepsilon}(T1) -{varepsilon} (T2)) defines a family of dissimilarity measures, covering previously studied parent-child and ancestor-descendent metrics. Also, we show that the nearest mapped tree problem is polynomial-time solvable for the adjacency matrix embedding and is[N][P] -hard for the ancestry and the distance embeddings. The weighted centroid tree problem for a given set of trees of size k is naturally defined as a nearest mapped tree solution to a weighted sum of the corresponding matrix set. In this article we consider uniform weighted-sums for which all weights are equal, where in particular, the (classical) centroid tree is defined to be a solution when all weights are chosen to be equal to 1/k (i.e., the mean case). Similarly, the{omega} -weighted centroid tree is a solution when all weights are equal to{omega} /k. To show the generality of our setup, we prove that the solution-set of the centroid tree problem for the adjacency and the ancestry matrices are identical to the solution-set of the consensus tree problem for parent-child and ancestor-descendent distances already handled by the algorithms GraPhyC(2018) and TuELiP(2023), respectively. Next, to tackle this problem for some new cases, we provide integer linear programs to handle the nearest mapped tree problem for the ancestry and the distance embeddings, giving rise to solutions of the weighted centroid tree problem in these cases. To show the effectiveness of this approach, we provide an algorithm, WAncILP2, to solvethe 2-weighted centroid tree problem for the case of the ancestry matrix and we justify the importance of the weighted setup by showing the pioneering performance of WAncILP2 both in a comprehensive simulation analysis as well as on a real breast cancer dataset, in which, by finding the centroids as representatives of data clusters, we provide supporting evidence for the fact that some common aspects of these centroids can be considered as suitable candidates for reliable evolutionary information in relation to the original data. metrics.
Brand, M.; Tran, N. K.; Spohr, P.; Schrinner, S.; Klau, G. W.
Show abstract
We consider the homo-edit distance problem, which is the minimum number of homo-deletions or homo-insertions to convert one string into another. A homo-insertion is the insertion of a string of equal characters into another string, while a homo-deletion is the inverse operation. We show how to compute the homo-edit distance of two strings in polynomial time: We first demonstrate that the problem is equivalent to computing a common subsequence of the two input strings with a minimum number of homo-deletions and then present a dynamic programming solution for the reformulated problem. 2012 ACM Subject ClassificationApplied computing [->] Bioinformatics; Applied computing [->] Molecular sequence analysis; Theory of computation [->] Dynamic programming
Martayan, I.; Cazaux, B.; Limasset, A.; Marchet, C.
Show abstract
In this paper, we introduce the Conway-Bromage-Lyndon (CBL) structure, a compressed, dynamic and exact method for representing k-mer sets. Originating from Conway and Bromages concept, CBL innovatively employs the smallest cyclic rotations of k-mers, akin to Lyndon words, to leverage lexicographic redundancies. In order to support dynamic operations and set operations, we propose a dynamic bit vector structure that draws a parallel with Elias-Fanos scheme. This structure is encapsulated in a Rust library, demonstrating a balanced blend of construction efficiency, cache locality, and compression. Our findings suggest that CBL outperforms existing dynamic k-mer set methods. Unique to this work, CBL stands out as the only known exact k-mer structure offering in-place set operations. Its different combined abilities position it as a flexible Swiss knife structure for k-mer set management. Availability: https://github.com/imartayan/CBL
Wolff, J.; Backofen, R.; Gruening, B.
Show abstract
Single-cell Hi-C interaction matrices are high dimensional and very sparse. To cluster thousands of single-cell Hi-C interaction matrices they are flattened and compiled into one matrix. This matrix can, depending on the resolution, have a few millions or even billions of features and any computation with it is therefore memory demanding. A common approach to reduce the number of features is to compute a nearest neighbors graph. However, the exact euclidean distance computation is in O(n2) and therefore we present an implementation of an approximate nearest neighbors method based on local sensitive hashing running in O(n). The presented method is able to process a 10kb single-cell Hi-C data set with 2500 cells and needs 53 GB of memory while the exact k-nearest neighbors approach is not computable with 1 TB of memory.
Kille, B.; Groot Koerkamp, R.; McAdams, D.; Liu, A.; Treangen, T.
Show abstract
MotivationSampling k-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee at least one k-mer is selected out of every w consecutive k-mers. Sampling fewer k-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e., have a small proportion of sampled k-mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two. ResultsWe prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k, we observe that our bound is tight when k {equiv} 1 (mod w). For large w and k, the bound can be approximated by [Formula]. Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the current default minimap2 HiFi settings w = 19 and k = 19, we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al., is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when k {equiv} 1 (mod w) and the alphabet size{sigma} goes to {infty}, we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound. Availability and implementationMinimizer implementations: github.com/RagnarGrootKoerkamp/minimizers ILP and analysis: github.com/treangenlab/sampling-scheme-analysis
Diseth, A. C.; Puglisi, S. J.
Show abstract
Given a sequence S of subsets of symbols drawn from an alphabet of size{sigma} , a subset rank query srank(i, c) asks for the number of subsets before the ith subset that contain the symbol c. It was recently shown (Alanko et al., Proc. SIAM ACDA, 2023) that subset rank queries on the spectral Burrows-Wheeler lead to efficient k-mer lookup queries, an essential and widespread task in genomic sequence analysis. In this paper we design faster subset rank data structures that use small space--less than 3 bits per k-mer. Our experiments show that this translates to new Pareto optimal SBWT-based k-mer lookup structures at the low-memory end of the space-time spectrum.
Sanaullah, A.; Villalobos, S.; Zhi, D.; Zhang, S.
Show abstract
Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly. In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching. Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome. However, haplotype matches in pangenome graph based haplotype sets are not trivially generalizable from haplotype matches in the linear reference genome based haplotype sets. Work has been done to represent large sets of haplotypes as paths through a pangenome graph. The graph Burrows-Wheeler transform (GBWT) is one such work. The GBWT essentially stores the haplotype paths in a run length compressed BWT with compressed local alphabets. Although efficient in practice count and locate queries on the GBWT were provided by the original authors, the efficient haplotype matching capabilities of the PBWT have never been shown on the GBWT. In this paper, we formally define the notion of haplotype matches in pangenome graph-based haplotype sets by generalizing from haplotype matches in linear reference genome-based haplotype sets. We also describe the relationship between set maximal matches, long matches, locally maximal matches, and text maximal matches on the GBWT, PBWT, and the BWT. We provide algorithms for outputting some of these matches by applying the data structures of the r-index (introduced by Gagie et al.) to the GBWT. We show that these structures enable set maximal match and long match queries on the GBWT in almost linear time and in space close to linear in the number of runs in the GBWT. We also provide multiple versions of the query algorithms for different combinations of the available data structures. The long match query algorithms presented here even run on the BWT in the same time complexity as the GBWT due to their similarity.
Truszkowski, J. M.; Gascuel, O.; Swenson, K.
Show abstract
Given trees T and T* on the same taxon set, the transfer index {phi}(b, T*) is the number of taxa that need to be ignored so that the bipartition induced by branch b in T is equal to some bipartition in T*. Recently, Lemoine et al. [14] used the transfer index to design a novel bootstrap analysis technique that improves on Felsensteins bootstrap on large, noisy data sets. In this work, we propose an algorithm that computes the transfer index for all branches b [isin] T in O(n log3 n) time, which improves upon the current O(n2)-time algorithm by Lin, Rajan and Moret [15]. Our implementation is able to process pairs of trees with hundreds of thousands of taxa in minutes and considerably speeds up the method of Lemoine et al. on large data sets. We believe our algorithm can be useful for comparing large phylogenies, especially when some taxa are misplaced (e.g. due to horizontal gene transfer, recombination, or reconstruction errors).
Jain, C.; Gibney, D.; Thankachan, S. V.
Show abstract
Co-linear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic-time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the co-linear chaining problem with anchor overlaps and gap costs in O(n) time, where n denotes the count of anchors. We also establish the first theoretical connection between co-linear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal anchored edit distance equals the optimal co-linear chaining cost. Finally, we demonstrate experimentally that optimal co-linear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient above 0.9 with edit distance for closely as well as distantly related sequences.
Kunzmann, P.
Show abstract
Alignment searches are fast heuristic methods to identify similar regions between two sequences. This group of algorithms is ubiquitously used in a myriad of software to find homologous sequences or to map sequence reads to genomes. Often the first step in alignment searches is k-mer decomposition: listing all overlapping subsequences of length k. This article presents a simple integer representation of k-mers and shows how a sequence can be quickly decomposed into k-mers in constant time with respect to k.
Shur, A.; Tziony, I.; Orenstein, Y.
Show abstract
Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k- 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. Recent studies developed methods to generate minimizers with optimal and near-optimal densities, but they require to explicitly store k-mer ranks in{Omega} (2k) space. While constant-space minimizers exist, and some of them are proven to be asymptotically optimal, no constant-space minimizers was proven to guarantee lower density compared to a random minimizer in the non-asymptotic regime, and many minimizer schemes suffer from long k-mer key-retrieval times due to complex computation. In this paper, we introduce 10-minimizers, which constitute a class of minimizers with promising properties. First, we prove that for every k > 1 and every w[≥] k- 2, a random 10-minimizer has, on expectation, lower density than a random minimizer. This is the first provable guarantee for a class of minimizers in the non-asymptotic regime. Second, we present spacers, which are particular 10-minimizers combining three desirable properties: they are constant-space, low-density, and have small k-mer key-retrieval time. In terms of density, spacers are competitive to the best known constant-space minimizers; in certain (k, w) regimes they achieve the lowest density among all known (not necessarily constant-space) minimizers. Notably, we are the first to benchmark constant-space minimizers in the time spent for k-mer key retrieval, which is the most fundamental operation in many minimizers-based methods. Our empirical results show that spacers can retrieve k-mer keys in competitive time (a few seconds per genome-size sequence, which is less than required by random minimizers), for all practical values of k and w. We expect 10-minimizers to improve minimizers-based methods, especially those using large window sizes. We also propose the k-mer key-retrieval benchmark as a standard objective for any new minimizer scheme.
Yao, H.-T.; Chauve, C.; REGNIER, M.; Ponty, Y.
Show abstract
The problem of RNA design attempts to construct RNA sequences that perform a predefined biological function, identified by several additional constraints. One of the foremost objective of RNA design is that the designed RNA sequence should adopt a predefined target secondary structure preferentially to any alternative structure, according to a given metrics and folding model. It was observed in several works that some secondary structures are undesignable, i.e. no RNA sequence can fold into the target structure while satisfying some criterion measuring how preferential this folding is compared to alternative conformations.\n\nIn this paper, we show that the proportion of designable secondary structures decreases exponentially with the size of the target secondary structure, for various popular combinations of energy models and design objectives. This exponential decay is, at least in part, due to the existence of undesignable motifs, which can be generically constructed, and jointly analyzed to yield asymptotic upper-bounds on the number of designable structures.
Rocha, L. B.; Adi, S. S.; Araujo, E.
Show abstract
In computational biology, mapping a sequence s onto a sequence graph G is a significant challenge. One possible approach to addressing this problem is to identify a walk p in G that spells a sequence which is most similar to s. This problem is known as the Graph Sequence Mapping Problem (GSMP). In this paper, we study an alternative problem formulation, namely the De Bruijn Graph Sequence Mapping Problem (BSMP), which can be stated as follows: given a sequence s and a De Bruijn graph Gk (where k[≥] 2), find a walk p in Gk that spells a sequence which is most similar to s according to a distance metric. We present both exact algorithms and approximate distance heuristics for solving this problem, using edit distance as a criterion for measuring similarity.
Rahmann, S.; Zentgraf, J.
Show abstract
Read mapping (and alignment) is a fundamental problem in biological sequence analysis. For speed and computational efficiency, many popular read mappers tolerate only a few differences between the read and the corresponding part of the reference genome, which leads to reference bias: Reads with too many difference are not guaranteed to be mapped correctly or at all, because to even consider a genomic position, a sufficiently long exact match (seed) must exist. While pangenomes and their graph-based representations provide one way to avoid reference bias by enlarging the reference, we explore an orthogonal approach and consider stronger substitution-tolerant primitives, namely spaced seeds or gapped k-mers. Given two integers k [≤] w, one considers k selected positions, described by a mask, from each length-w window in a sequence. In the existing literature, masks with certain probabilistic guarantees have been designed for small values of k. Here, for the first time, we take a combinatorial approach from a worst-case perspective. For any mask, using integer linear programs, we find least favorable distributions of sequence changes in two different senses: (1) minimizing the number of unchanged windows; (2) minimizing the number of positions covered by unchanged windows. Then, among all masks of a given shape (k, w), we find the set of best masks that maximize these minima. As a result, we obtain highly robust masks, even for large numbers of changes. Their advantages are illustrated in two ways: First, we provide a new challenge dataset of simulated DNA reads, on which current methods like bwa-mem2, minimap2, or strobealign struggle to find seeds, and therefore cannot produce alignments against the human t2t reference genome, whereas we are able to find the correct location from a few unique spaced seeds. Second, we use real DNA data from the highly diverse human HLA region, which we are able to map correctly based on a few exactly matching spaced seeds of well-chosen masks, without evaluating alignments.
O'Shea, R. J.
Show abstract
MotivationGraph canonisation and isomorphism testing representation are fundamental computational problems, whose complexity has remained unsolved to date. This study examines graph eigenprojections, demonstrating that linear-ordering transformations induce canonical properties therein to yield polynomial-time canonisation and isomorphism testing in all undirected graphs. ResultsThis study presents an exact method to identify analogous vertices in isomorphic graphs, through comparison of vertices eigenprojection matrices, which are shown to be related by a linear permutation. Systematic perturbation strategies are developed to reduce degeneracy whilst conserving isomorphism, through the addition of characteristically weighted self-loops to analogous vertices. Repeated iterations of analogy testing and perturbation deliver canonical vertex labelling and recovery of isomorphic mappings in [Formula] time in all graphs. Analytical proofs are provided to support claims and experimental performance is demonstrated in biological and synthetic data, with comparison to a commonly used heuristic algorithm. Availability and ImplementationSource code is provided at github.com/robertoshea/graph_isomorphism. Contactrobert.1.oshea@kcl.ac.uk Supplementary Data.Not applicable.
Schmidt, S.; Toivonen, S.; Medvedev, P.; Tomescu, A. I.
Show abstract
Despite the long history of genome assembly research, there remains a large gap between the theoretical and practical work. There is practical software with little theoretical underpinning of accuracy on one hand and theoretical algorithms which have not been adopted in practice on the other. In this paper we attempt to bridge the gap between theory and practice by showing how the theoretical safe-and-complete framework can be integrated into existing assemblers in order to improve contiguity. The optimal algorithm in this framework, called the omnitig algorithm, has not been used in practice due to its complexity and its lack of robustness to real data. Instead, we pursue a simplified notion of omnitigs, giving an efficient algorithm to compute them and demonstrating their safety under certain conditions. We modify two assemblers (wtdbg2 and Flye) by replacing their unitig algorithm with the simple omnitig algorithm. We test our modifications using real HiFi data from the Drosophilia melanogaster and the Caenorhabditis elegans genome. Our modified algorithms lead to a substantial improvement in alignment-based contiguity, with negligible computational costs and either no or a small increase in the number of misassemblies.